Windows Server 2012 : Backup and Recovery (part 1) - Disaster-planning strategies, Disaster preparedness procedures

7/9/2013 7:13:30 PM

1. Disaster-planning strategies

Ask three different people what their idea of a disaster is and you’ll probably get three different answers. For most administrators, the term “disaster” probably means any scenario in which one or more essential system services cannot operate and the prospects for quick recovery are less than hopeful—that is, a disaster is something a service reset or system reboot won’t fix.

To ensure that operations can be restored as quickly as possible in a given situation, every network needs a clear disaster recovery plan. Many of the same concepts go into disaster planning as when you are planning for highly available, scalable, and manageable systems. Why? Because, at the end of the day, disaster planning involves implementing plans that ensure the availability of systems and services. Remember that part of disaster planning is applying some level of contingency planning to every essential network service and system. You need to implement problem escalation and response procedures. You also need a standing problem-resolution document that describes in great detail what to do when disaster strikes.

Developing contingency procedures

You should identify the services and systems that are essential to network operations. Typically, this list will include the following components:

Network infrastructure servers running Active Directory, Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), Remote Desktop Services, and Routing and Remote Access Service (RRAS)
File, database, and application servers, such as servers with essential file shares or those that provide database or email services
Networking hardware, including switches, routers, and firewalls

Combine your availability, scalability, and manageability plans with plans for contingency procedures in the following areas:

Physical security Place network hardware and servers in a locked, secure access facility. This could be an office that is kept locked or a server room that requires a passkey to enter. When physical access to network hardware and servers requires special access privileges, you prevent many problems and ensure that only authorized personnel can get access to systems from the console.
Data backup Implement a regular backup plan that ensures that multiple datasets are available for all essential systems, and that these backups are stored in more than one location. For example, if you keep the most current backup sets on-site in the server room, you should rotate another backup set to off-site storage. In this way, if disaster strikes, you will be more likely to be able to recover operations.
Fault tolerance Build redundancy into the network and system architecture. At the server level, you can protect data using a redundant array of independent disks (RAID) and guard against component failure by having spare parts on hand. These precautions protect servers at a very basic level.
Recovery Every essential server and network device should have a written recovery plan that details step by step what to do to rebuild and recover it. Be as detailed and explicit as possible, and don’t assume that the readers know anything about the system or device they are recovering. Do this even if you are sure that you’ll be the one performing the recovery—you’ll be thankful for it, trust me. Things can and do go wrong at the worst times, and sometimes, under pressure, you might forget some important detail in the recovery process—not to mention that you might be unavailable to recover the system for some reason.
Power protection Power-protect servers and network hardware using an uninterruptible power supply (UPS) system. Power protection will help safeguard servers and network hardware from power surges and dirty power. Power protection will also help prevent data loss and allow you to power down servers in an appropriate fashion through manual or automatic shutdown.

Putting in a UPS requires a bit of planning, because you need to look not only at servers but also at everything in the server room that requires power. If the power goes out, you want to have ample time for systems to shut down in an orderly fashion. You might also have some systems that you do not want to be shut down, such as routers or servers required for security key cards. In most cases, rather than using individual UPS devices, you should install enterprise UPS solutions that can be connected to several servers or components.

After you install a UPS, you can configure servers to take advantage of UPS using the management software included with the UPS. You can then configure the way a server reacts when it switches to battery power. Typically, you’ll want servers to start an orderly shutdown within a few minutes of switching to battery power.

In your planning, remember that 90 percent of power outages last less than 5 minutes and 99 percent of power outages last less than 60 minutes. With this in mind, you might want to plan your UPS implementation so that you can maintain 7 to 10 minutes of power for all server and network components and 60 to 70 minutes for critical systems. You would then configure all noncritical systems to shut down automatically after 5 minutes and configure critical systems to shut down after 60 minutes.

Implementing problem-escalation and response procedures

As part of planning, you need to develop well-defined problem-escalation procedures that document how to handle problems and emergency changes that might be needed. You need to designate an incident response team and an emergency response team. Although the two teams could consist of the same team members, the teams differ in fundamental ways:

Incident response team The incident response team’s role is to respond to security incidents, such as the suspected cracking of a database server. This team is concerned with responding to an intrusion, taking immediate action to safeguard the organization’s information, documenting the security issue thoroughly in an after-action report, and then fixing the security problem so that the same type of incident cannot recur. Your organization’s security administrator or network security expert should have a key role in this team.
Emergency response team The emergency response team’s role is to respond to service and system outages, such as the failure of a database server. This team is concerned with recovering the service or system as quickly as possible and allowing normal operations to resume. Like the incident response team, the emergency response team needs to document the outage thoroughly in an after-action report, and then, if applicable, propose changes to improve the recovery process. Your organization’s system administrators should have key roles in this team.

Creating a problem-resolution policy document

Over the years, I’ve worked with and consulted for many organizations, and I’ve often been asked to help implement information technology (IT) policies and procedures. In the area of disaster and recovery planning, there’s one policy document that I always use, regardless of the size of the company I am working with. I call it the problem-resolution policy document.

The problem-resolution policy document has the following six sections:

Responsibilities The overall responsibilities of IT and engineering staff during and after normal business hours should be detailed in this section. For an organization with 24/7 operations, such as a company with a public World Wide Web site maintained by internal staff, the after-hours responsibilities section should be very detailed and let individuals know exactly what their responsibilities are. Most organizations with 24/7 operations will designate individuals as being “on call” 7 days a week, 365 days a year, and in that case, this section should detail what being “on call” means and what the general responsibilities are for an individual on call.
Phone roster Every system and service you identify in your planning as essential should have a point of contact. For some systems, you’ll have several points of contact. Consider, for example, a database server. You might have a system administrator who is responsible for the server itself, a database administrator who is responsible for the database running on the server, and an integration specialist responsible for any integration components running on the server.

Important

The phone roster should include both on-site and off-site contact numbers. Ideally, this means that you’ll have the work phone number, cell phone number, and pager number of each contact. It should be the responsibility of every individual on the phone roster to ensure that contact information is up to date.
Key contact information In addition to a phone roster, you should have contact numbers for facilities and vendors. The key contacts list should include the main office phone numbers at branch offices and data centers and contact numbers for the various vendors that installed infrastructure at each office, such as the building manager, Internet service provider (ISP), electrician, and network wiring specialist. It should also include the support phone numbers for hardware and software vendors and the information you’ll be required to give in order to get service, such as customer identification number and service contract information.
Notification procedures The way problems get resolved is through notification. This section should outline the notification procedures and the primary point of contact in case of outage. If many systems and services are involved, notification and primary contacts can be divided into categories. For example, you might have an external systems-notification process for your public Internet servers and an internal systems-notification process for your intranet services.
Escalation When problems aren’t resolved within a specific timeframe, there should be clear escalation procedures that detail whom to contact and when. For example, you might have level 1, level 2, and level 3 points of contact, with level 1 contacts being called immediately, level 2 contacts being called when issues aren’t resolved in 30 minutes, and level 3 contacts being called when issues aren’t resolved in 60 minutes.

Important

You should also have a priority system in place that dictates what types of incidents or outages take precedence over others. For example, you could specify that service-level outages, such as those that involve the complete system, have priority over an isolated outage involving a single server or application, but that suspected security incidents have priority over all other issues.
Post-action reporting Every individual involved in a major outage or incident should be expected to write a post-action report. This section details what should be in that report. For example, you would want to track the notification time, actions taken after notification, escalation attempts, and other items that are important to improving the process or preventing the problem from recurring.

Every IT group should have a general policy with regard to problem-resolution procedures, and this policy should be detailed in a problem-resolution policy document or one like it. The document should be distributed to all relevant personnel throughout the organization so that every person who has some level of responsibility for ensuring system and service availability knows what to do in the case of an emergency. After you implement the policy, you should test it to help refine it so that the policy will work as expected in an actual disaster.

2. Disaster preparedness procedures

Just as you need to perform planning before disaster strikes, you also need to perform certain predisaster preparation procedures. These procedures ensure that you are able to recover systems as quickly as possible when a disaster strikes and include the following:

Backups
Startup repair
Recovery disks
Startup and recovery options
Recovery Console

Performing backups

You should perform regular backups of every server. Backups can be performed using several techniques. Most organizations choose a combination of dedicated backup servers and per-server backups. If you use professional backup software, you can use one or more dedicated backup servers to create backups of other servers on the network, and then write the backups to media on centralized backup devices. If you use per-server backups, you run backup software on each server that you want to back up and store the backup media on a local backup device. By combining the techniques, you get the best of both worlds.

With dedicated backup servers, you purchase professional backup software, a backup server, and a scalable backup device. The initial costs for purchasing the required equipment and the time required to set up the backup environment can be substantial. However, after the backup environment is configured, it is rather easy to maintain. Centralized backups also offer substantial time savings for administrators because the backup process itself can be fully automated.

Repairing startup

Like its predecessors, Windows Server 2012 has several automatic repair features. If the boot manager or corrupted system file is preventing startup, the Startup Repair tool is started automatically and will initiate the repair of the server. The Startup Repair tool can be helpful if one or more of the following problems are preventing startup:

A virus infection in the master boot record
A missing or corrupt boot manager
A boot configuration data store with bad entries
A corrupted system file

Although Startup Repair typically runs automatically, you can manually initiate this feature by completing the following steps:

If the computer won’t start normally, you’ll see a Windows Boot Manager error screen stating that Windows failed to start. Press Enter.
On the OS Selection screen, press F8.
On the Advanced Boot Options screen, choose an appropriate safe mode or other alternate mode to try to start the server so that you can log in to diagnose and resolve the problem.

You also can use the installation disc to initiate recovery. To do so, follow these steps:

Insert the Windows Installation disc, and then boot from the installation disc by pressing a key when prompted during startup. If the server does not allow you to boot from the installation disc, you might need to change firmware options to allow booting from a CD/DVD-ROM drive.
Windows Setup should start automatically. On the Install Windows page, select the language, time, and keyboard layout options that you want to use. Tap or click Next.
When prompted, do not tap or click Install Now. Instead, tap or click the Repair Your Computer link in the lower left corner of the Install Windows page.
On the Recovery screen, tap or click Troubleshoot. Then, on the Advanced Options screen, tap or click Command Prompt to access the MINWINPC environment.
Change directories to x:\sources\recovery by typing cd recovery.
Run the Startup Repair Wizard by typing startrep.

You can recover a server’s operating system or perform a full system recovery by using a Windows installation disc and a backup that you created earlier with Windows Server Backup. To initiate a recovery, on the Recovery screen, tap or click Troubleshoot. Then, on the Advanced Options screen, tap or click System Image Recovery.

With an operating system recovery, you recover all critical volumes but do not recover nonsystem volumes. If you recover a full system, Windows Server Backup reformats and repartitions all disks that are attached to the server. Because of this, you should use this method only when you want to recover the server data onto separate hardware or when all other attempts to recover the server on the existing hardware have failed.

Setting startup and recovery options

As part of planning for the worst-case scenarios, you need to consider how you want systems to start up and recover if a stop error is encountered. The options you choose can add to the boot time or they can specify that if a system encounters a stop error it does not reboot.

You can configure startup and recovery options by completing the following steps:

In the Control Panel, tap or click System And Security\System to start the System utility.
Tap or click Advanced System Settings. This opens the System Properties dialog box.
On the Advanced tab, tap or click Settings in the Startup And Recovery panel. This displays the dialog box shown in Figure 1.

Figure 1. Configuring startup and recovery options.
In the Startup And Recovery dialog box, you configure the settings as follows:
- If a server has multiple operating systems, you can set the default operating system by selecting one of the operating systems in the Default Operating System list. These options are obtained from the boot manager.
- When multiple operating systems are installed, the Time To Display List Of Operating Systems option controls how long the system waits before booting to the default operating system. In most cases, you won’t need more than a few seconds to make a choice, so reduce this wait time to perhaps 5 or 10 seconds. Alternatively, you can have the system automatically choose the default operating system by clearing this option.
- When you want to display recovery options, the operating system uses the Time To Display Recovery Options When Needed setting to determine how long to wait for you to choose a recovery option. The default wait time is 30 seconds. If you don’t choose a recovery option in that time, the system boots normally without recovery. As with operating systems, you won’t need more than a few seconds to make a choice, so reduce this wait time to perhaps 5 or 10 seconds.
- Under System Failure, you have several important options for determining what happens when a system experiences a stop error. By default, the Write An Event To The System Log check box is selected so that the system logs an error in the system log. The check box appears dimmed, so it cannot normally be changed. The Automatically Restart check box is selected to ensure that the system attempts to reboot when a stop error occurs.
  
  Important
  
  In some cases, you might want the system to halt rather than reboot. For example, if you are having problems with a server, you might want it to halt so that an administrator will be more likely to notice that it is experiencing problems. Don’t, however, prevent automatic reboot without a specific reason.
- The Write Debugging Information options allow you to choose the type of debugging information that should be created when a stop error occurs. In most cases, you will want debug information to be dumped so that you can use it to determine the cause of a crash.
  
  Important
  
  If you choose a kernel memory dump, you dump all physical memory being used at the time of the failure. You can create the dump file only if the system is properly configured. The system drive must have a paging file at least as large as RAM and adequate disk space to write the dump file.
- By default, dump files are written to the %SystemRoot% folder. If you want to write the dump file to a different location, type the file path in the Dump File box. Select the Overwrite Any Existing File option to ensure that only one dump file is maintained.
Tap or click OK twice to close all open dialog boxes.